RenderFlow

Single-Step Neural Rendering via Flow Matching

Zhang et al.
Presented by Manish Mathai

March 5, 2026

What is Rendering?

The process of turning a 3D scene into a 2D image

Scene

Geometry: surfaces made of millions of tiny triangles
Materials: color, roughness, metallic – how surfaces look
Lights: environment maps, point lights, the sun

Path Tracing

The gold standard for realism
Trace rays of light as they bounce around the scene
Each bounce picks a random direction. It needs thousands of rays to converge
Problem: very expensive
- Convergence can take minutes to hours per image
- Need many samples(light rays) per pixel to reduce noise

Samples vs. Quality

1 SPP (top-left) to 32,768 SPP (bottom-right), doubling left to right.

What if we could skip the expensive sampling entirely?

Rasterization

Project triangles to pixels instead of tracing rays
First pass: Store scene properties as images called G-buffers:
- Albedo (base color), Normals, Depth, Roughness, Metallic
Second pass: Combines these for fast shading
- No global effects: soft shadows, reflections, indirect light. They are faked with cheaper, rough approximations
This pipeline is called deferred rendering

The Gap

Path Tracing

Physically accurate
Minutes to hours per frame
The “ground truth”

Deferred Rendering

Real-time (milliseconds)
Misses global illumination
G-buffers are cheap to produce

Path Tracing vs Rasterization

The Quality Gap: Interiors

Rasterization (left) vs path tracing (right)

The Quality Gap: Subtle Details

Best of both worlds?

Can we get path-tracing quality from G-buffers… using a neural network?

Prior Work: Diffusion Models

Models like Stable Diffusion, DALL-E learn to generate images from noise
Forward process: gradually add noise to an image until it’s pure static
Reverse process: a neural network learns to undo the noise, step by step
Typically needs 20-50 denoising steps to produce a clean image
What if we condition the reverse process on G-buffers to guide it toward a rendered image?

RGB-X (SIGGRAPH 2024)

Condition a diffusion model on G-buffers to synthesize realistic images
Estimates intrinsic channels: the surface properties like albedo, normals, etc.
Also works in reverse: RGB image -> G-buffer decomposition
~50 denoising steps, ~2.2 seconds per frame

DiffusionRenderer (CVPR 2025)

Extends the idea to video using a video diffusion model
Handles temporal consistency across frames
Trained on synthetic + auto-labeled real-world data
Enables relighting, material editing, and object insertion from a single video
~30 denoising steps, ~1.4 seconds per frame

But Two Problems…

Slow: 20-50 denoising steps per frame
- RGB-X: ~2.2 seconds per frame
- DiffusionRenderer: ~1.4 seconds per frame
- Not real-time (need < 33ms for 30fps)
Stochastic: different random seeds produce different results
- Flickering between frames: shadows appearing and disappearing, lighting and reflections shifting
- Not reproducible, making it bad for production pipelines

Flow Matching

Alternative to diffusion: learn a velocity field that transports samples from source to target distribution
Deterministic: follows a deterministic ODE instead of a stochastic process
- The same input gives the same output every time
Rectified flow: encourages straight-line trajectories between paired samples
- Straight paths incur zero discretization error – can be solved in as few as 1 Euler step

RenderFlow: Key Idea

Learn a single-step flow: G-buffers -> rendered image
Key insight: replace noise with albedo as the starting point
- Already spatially aligned with the target. It has the right colors, textures, structure
- Network only learns the residual: shadows, reflections, global illumination
- Much smaller “distance” than noise -> image. A single step suffices
Single forward pass: ~0.19s per frame (10x faster than diffusion methods)

How to Train It: Bridge Matching

Pure flow matching trains on exact straight-line paths. That can be brittle
Bridge matching: add small noise perturbations during training only
- \(z_t = (1-t)z_0 + tz_1 + \sigma\sqrt{t(1-t)}\epsilon\)
- \(z_0\) = albedo, \(z_1\) = rendered image, \(\sigma\) = noise scale, \(\epsilon\) = random noise
- Acts as a regularizer and the model sees diverse variations of the path
- When \(\sigma = 0\), reduces to pure flow matching
Result: more robust to variations in lighting and materials
Inference remains deterministic – noise is a training trick only

Train Multi-Step, Infer Single-Step

Train with bridge matching at 4 discrete timesteps [1.0, 0.75, 0.5, 0.25], \(\sigma = 0.005\)
But infer in 1 step. Just one forward pass
Why does this work?
- Multi-step training exposes the model to intermediate states
- Single-step inference avoids error accumulation across steps

Training	Inference	PSNR
4-step ODE	4 steps	23.09
4-step ODE	1 step	23.30
4-step SDE	4 steps	23.38
4-step SDE	1 step	23.59

1-step inference outperforms multi-step because fewer steps means less error accumulation

PSNR (Peak Signal-to-Noise Ratio): higher = better. Ablation at 256x256.

Architecture

Repurposes a pretrained video diffusion model (Wan2.1, 1.3B parameter DiT)
- Albedo replaces noise input; text cross-attention removed
All inputs (G-buffers + environment map) encoded by VAE into latent space
G-buffer tokens added element-wise to albedo tokens (spatially aligned as same pixel locations)
Envmap Adapter: environment map injected via adaptive normalization (scale + shift) because not spatially aligned like G-buffers

Training Losses

Latent loss: bridge matching loss in VAE latent space (the core objective)
Pixel losses applied after decoding back to image space:
- LPIPS: perceptual similarity (captures structural differences humans notice)
- Gradient loss: preserves high-frequency details like contact shadows
Total: \(\mathcal{L}_{total} = \mathcal{L}_{latent} + \lambda \mathcal{L}_{pixel}\)

Video Inference

Model trained on short clips (5 frames) for memory efficiency
Long videos rendered in overlapping chunks:
- Last frame of chunk N becomes the conditioning frame for chunk N+1
- Promotes smooth transitions and temporal coherence
Combined with keyframe guidance for best results

Keyframe Guidance

G-buffers alone lack global lighting info (shadows, reflections are ambiguous)
Solution: feed sparse path-traced keyframes as additional guidance
- e.g., one high-quality reference frame every 16 frames
Keyframe Adapter: cross-attention branch injected into each transformer block
- Uses RoPE to encode temporal distance between keyframe and current frame
Two-stage training: train base model first, freeze it, then train only the adapter
- Base performance unchanged when no keyframes are provided

Keyframe Guidance: Impact

Keyframe Gap	PSNR
No keyframes	24.02
Every 49 frames	25.92
Every 25 frames	26.57
Every 13 frames	29.72

Even sparse keyframes (every 49 frames) significantly outperform no guidance
More keyframes = better quality, as expected
Negligible speed impact (~0.24s vs ~0.19s per frame)

Ablation at 256x256 (Supplementary Table S1); full-resolution results in Table 1

Inverse Rendering

Can we run the model backwards? RGB image -> G-buffers?
Freeze the entire forward model, add lightweight adapters:
- LoRA on self-attention (same pattern as LLM fine-tuning)
- Cross-attention conditioned on a text prompt (“albedo”, “normal”, etc.)
- Per-intrinsic MLP heads for each output type
One unified model handles both forward and inverse rendering

Results: Quantitative

Traditional baseline (not a neural method):

Method	Paradigm	Params	PSNR	LPIPS	Time (s)
Deferred	Traditional	-	24.65	0.097	real-time

Neural rendering methods:

Method	Paradigm	Params	PSNR	LPIPS	Time (s)
RGB-X	Diffusion	950M	20.98	0.165	~2.19
DiffusionRenderer	Diffusion	1.7B	23.76	0.128	~1.40
Ours (w/o key)	Flow	1.4B	24.21	0.113	~0.19
Ours (w/ key)	Flow	1.7B	26.66	0.101	~0.24

LPIPS (Learned Perceptual Image Patch Similarity): lower = better perceptual quality

10x faster than RGB-X, 7x faster than DiffusionRenderer
Outperforms both neural baselines on all metrics, even without keyframes
With keyframes: surpasses even traditional deferred rendering

Results: Deterministic

RenderFlow: zero variance across runs (deterministic)
Diffusion baselines: significant variance (stochastic)
Same input always produces the exact same output
Critical for production: no flickering, reproducible results

Results: Visual Comparison

Dataset

No existing large-scale rendering dataset with G-buffers + environment maps
Built a custom dataset using Unreal Engine 5 Movie Render Queue:
- Artist-crafted: 30,000 frames from professional scenes
- Procedural: 100,000 frames from randomly composed scenes
  - 4,000 unique meshes, 30 HDR environment maps
  - Randomized material attributes for diversity
All rendered at 512x512, 256 SPP, denoised with Intel Open Image Denoise
Both baselines (RGB-X, DiffusionRenderer) fine-tuned on this same dataset

Limitations

VAE bottleneck: encoder/decoder accounts for ~90% of inference time
- The transformer itself is fast; the VAE is the constraint
Dataset diversity: synthetic scenes only, limited lighting phenomena and geometric complexity
- Fails on highly complex geometries (fine-grained details lost in VAE compression)
Temporal blurring: causal VAE convolution causes later frames to blur
- Initial frame stays sharp; subsequent frames progressively soften
Resolution: trained and evaluated at 512x512 only

Discussion

Key takeaway: flow matching + albedo starting point = single-step rendering
Three contributions:
1. Single-step flow-based rendering (10x faster, deterministic)
2. Keyframe guidance adapter (significant quality boost)
3. Inverse rendering via frozen backbone + adapters
Open questions:
- Can the VAE bottleneck be eliminated?
- How does this scale to 1080p or 4K?
- Could this work with real-world captured scenes (not just Unreal Engine 5)?
- What about dynamic lighting changes within a sequence?

Thank You!

Questions and discussion